Machine Translation 4 Microblogs
نویسنده
چکیده
The emergence of social media caused a drastic change in the way information is published. In contrast to previous eras in which the written word was more dominated by formal registers, the possibility for people with different backgrounds to publish information has caused non-standard style, formality, content, genre and topic to be present in written documents. One source of such data are posts in microblogs and social networks, such as Twitter, Facebook and Sina Weibo. The people that publish these documents are not all professionals, yet the information published can be leveraged for many ends [Han and Baldwin, 2011, Hawn, 2009, Kwak et al., 2010, Sakaki et al., 2010]. However, current NLP tasks perform poorly in the presence of this type of data, since they are modelled using traditional assumptions and trained on existing edited data. One problem is the lack of annotated datasets in this domain. One such assumption is of spelling homogeneity, where we assume that there is only one way to spell tomorrow, whereas in microblogs, this word can be abbreviated to tmrw (among many other options) or spelled erroneously as tomorow. It is shown in [Gimpel et al., 2011] that using in-domain data and defining more domain specific features can help address this problem for Part-of-Speech Tagging. In this thesis, we address the challenge of NLP on the domain of informal online texts, with emphasis on Machine Translation. This thesis makes the following contributions in this respect. (1) We present an automatic method to extract such data automatically from microblog posts, by exploring the fact that many bilingual users post translations of their own posts. (2) We propose a compositional model for word understanding based only on the character sequence of those words, breaking the assumption that different word types are independent. This allows the model to generalize better on morphologically rich languages and the orthographically creative language used in microblogs. (3) Finally, we show improvements on several NLP tasks, both syntactically and semantically oriented, using both the crawled data and proposed character-based models. Ultimately, these are combined into a state-ofthe-art MT system in this domain.
منابع مشابه
Machine Translation in Microblogs
The emergence of social media caused a drastic change in the way information is published. The possibility for people with different backgrounds to publish information has caused non-standard style, formality, content, genre and topics to be present in documents that are published or texted. One such example are posts in microblogs and social networks, such as Twitter, Facebook and Sina Weibo. ...
متن کاملMining Parallel Corpora from Sina Weibo and Twitter
Microblogs such as Twitter, Facebook, and Sina Weibo (China’s equivalent of Twitter), are a remarkable linguistic resource. In contrast to content from edited genres such as newswire, microblogs contain discussions of virtually every topic by numerous individuals in different languages and dialects and in different styles. In this work, we show that some microblog users post “self-translated” m...
متن کاملA Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملSyntactic Normalization of Twitter Messages
The use of computer mediated communication such as emailing, microblogs, Short Messaging System (SMS), and chat rooms has created corpora which contain incredibly noisy text. Tweets, messages sent by users on Twitter.com, are an especially noisy form of communication. Twitter.com contains billions of these tweets, but in their current state they contain so much noise that it is difficult to ext...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کامل